HW 03

Author

Brooke Pacheco

0 - Setup

# load packages
library(tidyverse)
library(glue)
library(here)
library(countdown)
library(ggthemes)
library(gt)
library(openintro)
library(ggrepel)
library(patchwork)
library(ggh4x)
library(jpeg)
library(ggpubr)

# set theme for ggplot2
ggplot2::theme_set(ggplot2::theme_minimal(base_size = 11))

# set width of code output
options(width = 65)

# set figure parameters for knitr
knitr::opts_chunk$set(
  fig.width = 7,        # 7" width
  fig.asp = 0.618,      # the golden ratio
  fig.retina = 3,       # dpi multiplier for displaying HTML output on retina
  fig.align = "center", # center align figures
  dpi = 300             # higher dpi, sharper image
)

if (!require("pacman")) 
  install.packages("pacman")

# for installing/loading
pacman::p_load(tidyverse,
               glue,
               scales,
               ggthemes,
               ggh4x) 

devtools::install_github("tidyverse/dsbox")

1 - Du Bois challenge.

# Read in data from income file
income <- read_csv(here("data" ,"income.csv"))

# Load the image 
paper <- readJPEG(here("images" ,"paper.jpg"))

income <- income |>
  pivot_longer(
    cols = c(Rent, Food, Clothes, Tax, Other),
    names_to = "Type",
    values_to = "Total_Amount"
  ) |>
  mutate(
    Type = factor(Type, levels = c("Other", "Tax", "Clothes", "Food", "Rent")),
    Class = factor(Class, levels = c("$1000 AND OVER", "$750-1000", "$500-750", "$400-500", "$300-400", "$200-300", "$100-200")),
  ) |>
  group_by(Class) |>
  mutate(
    Class_Total = sum(Total_Amount),
    Percent_of_Class = Total_Amount / Class_Total
  ) |>
  ungroup()

# Create the histogram plot
ggplot(income, aes(x = Total_Amount, y = Class, fill = Type)) +
  annotation_raster(paper, xmin = -Inf, xmax = Inf, ymin = -Inf, ymax = Inf) +
  geom_col(position = "stack") +
  scale_fill_manual(
    values = c("Rent" = "#000000", "Food" = "#C3B1E1", "Clothes" = "#F89880", "Tax" = "#B7C9E2", "Other" = "#E6ECF5")) +
  geom_text(
    aes(
      label = ifelse(Percent_of_Class == 0, NA, percent(Percent_of_Class, accuracy = 1)),
      color = Type
    ),
    position = position_stack(vjust = 0.5),
    size = 3,
    show.legend = FALSE
  ) +
  scale_color_manual(
  values = c("Rent" = "white",
             "Food" = "black",
             "Clothes" = "black",
             "Tax" = "black",
             "Other" = "black")
  ) +
  labs(
    title = "Du Bois Plot Recreation",
    x = "Amount Spent",
    y = "Income Class"
  ) +
  theme_minimal(base_size = 11)

Sources

Hex color values: https://htmlcolorcodes.com/colors/

Referenced for percent function parameters: https://scales.r-lib.org/reference/percent_format.html

Referenced for ifelse usage: https://stat.ethz.ch/R-manual/R-devel/library/base/html/ifelse.html

To display the image I used: https://ggplot2.tidyverse.org/reference/annotation_raster.html

2 - COVID survey - interpret

Example 1:

For the response variable “Getting the vaccine will make me feel safer at work”, nurses had an average response close to 1, with a relatively narrow confidence interval. This indicates that most nurses agreed with the statement, and the estimate is fairly precise. Medical professionals, on the other hand, had a mean closer to 2 with a wider confidence interval. This suggests that while medical professionals also tended to agree, they were on average slightly less certain than nurses. Additionally, the wider confidence interval for medical professionals indicates greater variability in their responses and less precision in the estimate compared to the nurses. The responses did surprise me. I initially thought that medical professionals might be more confident than nurses because they typically have more training. However, on second thought, the confidence interval for the medical professionals group is much wider, suggesting greater variability and less precision in their responses.

Example 2:

For the response variable “I am concerned about the safety and side effects of the vaccine”, the Non-Hispanic/Non-Latino group has a slightly higher mean (around 3.5) with a relatively wide confidence interval. The Hispanic/Latino group has a mean closer to 3, also with a wide confidence interval. Notably, the confidence interval for the Non-Hispanic/Non-Latino group is slightly wider than that of the Hispanic/Latino group. These results suggest that, on average, both groups are neutral in their concern about the vaccine’s safety and side effects. However, the wide confidence intervals indicate a fair amount of uncertainty around these means, making the estimates less precise. Though the overlap between intervals suggests no statistically significant difference between the two groups. The results do not surprise me; regardless of ethnicity, I would expect both groups to have similar opinions.

Example 3:

For the response variable “I trust the information that I have received about the vaccines”, the group that reported having received the COVID vaccine had a mean close to 1, with a narrow confidence interval. This indicates strong agreement with the statement and a high level of precision in the estimate. In contrast, the group that reported not having received the vaccine had a mean around 3, with a wide confidence interval. This suggests that those who were unvaccinated were generally neutral in their trust of vaccine information, and the wide confidence interval reflects greater uncertainty and less precision in this group’s response. The results make sense; I would expect unvaccinated individuals to be less confident in the information compared to those who have received the vaccine.

3 - COVID survey - reconstruct

# Read in data from COVID survey file and skip first row
covid_survey <- read_csv(here("data", "covid-survey.csv"), skip = 1)

# Print dimensions of data frame
dim(covid_survey)
[1] 1121   14
# Data cleanup - eliminate any rows where all values aside from response_id are missing
cleaned_survey <- covid_survey |>
  filter(!if_all(-response_id, is.na))

# Print dimensions of cleaned data frame
dim(cleaned_survey)
[1] 1111   14
# Relabel the survey response values
relabeled_survey <- cleaned_survey |>
  mutate(
    exp_already_vax = factor(case_match(exp_already_vax, 0 ~ "No", 1 ~ "Yes")),
    exp_flu_vax = factor(case_match(exp_flu_vax, 0 ~ "No", 1 ~ "Yes")),
    exp_profession = factor(case_match(exp_profession, 0 ~ "Medical", 1 ~ "Nursing")),
    exp_gender = factor(case_match(exp_gender, 0 ~ "Male", 1 ~ "Female", 3 ~ "Non-binary/Third gender", 4 ~ "Prefer not to say")),
    exp_race = factor(case_match(exp_race, 1 ~ "American Indian /\nAlaskan Native", 2 ~ "Asian", 3 ~ "Black / African American", 4 ~ "Native Hawaiian / Other\nPacific Islander", 5 ~ "White")),
    exp_ethnicity = factor(case_match(exp_ethnicity, 1 ~ "Hispanic / Latino", 2 ~ "Non-Hispanic / Non-Latino")),
    exp_age_bin = factor(case_when(exp_age_bin == 0 ~ "<20", exp_age_bin == 20 ~ "21-25", exp_age_bin == 25 ~ "26-30", exp_age_bin == 30 ~ ">30", TRUE ~ as.character(exp_age_bin))
    )
  )

# Print dimensions of relabeled survey
dim(relabeled_survey)
[1] 1111   14
# Calculate the 10th percentile, mean, and 90th percentile of each of the response variables for each level of each explanatory variable.
covid_survey_longer <- relabeled_survey |>
  # Takes all the columns starting with "exp_" and creates them into two columns: explanatory and explanatory_value.
  pivot_longer(
    cols = starts_with("exp_"),
    names_to = "explanatory",
    values_to = "explanatory_value"
  ) |>
  filter(!is.na(explanatory_value)) |>
  # Takes all the columns starting with "resp_" and creates them into two columns: response and response_value.
  pivot_longer(
    cols = starts_with("resp_"),
    names_to = "response",
    values_to = "response_value"
  )

# Print dimensions of data frame and confirm tribble table matches homework 
covid_survey_longer
# A tibble: 43,428 × 5
   response_id explanatory    explanatory_value response         
         <dbl> <chr>          <fct>             <chr>            
 1           1 exp_profession Nursing           resp_safety      
 2           1 exp_profession Nursing           resp_confidence_…
 3           1 exp_profession Nursing           resp_concern_saf…
 4           1 exp_profession Nursing           resp_feel_safe_a…
 5           1 exp_profession Nursing           resp_will_recomm…
 6           1 exp_profession Nursing           resp_trust_info  
 7           1 exp_flu_vax    Yes               resp_safety      
 8           1 exp_flu_vax    Yes               resp_confidence_…
 9           1 exp_flu_vax    Yes               resp_concern_saf…
10           1 exp_flu_vax    Yes               resp_feel_safe_a…
# ℹ 43,418 more rows
# ℹ 1 more variable: response_value <dbl>
dim(covid_survey_longer)
[1] 43428     5
# Group data in covid_survey_longer
covid_survey_summary_stats_by_group <- covid_survey_longer |>
  group_by(explanatory, explanatory_value, response) |>
  summarise(
    mean = mean(response_value, na.rm = TRUE),
    low = quantile(response_value, 0.10, na.rm = TRUE),
    high = quantile(response_value, 0.90, na.rm = TRUE),
    .groups = "drop"
  )

# View the summary tibble
covid_survey_summary_stats_by_group
# A tibble: 126 × 6
   explanatory explanatory_value response        mean   low  high
   <chr>       <fct>             <chr>          <dbl> <dbl> <dbl>
 1 exp_age_bin <20               resp_concern_…  3.35     2   4.4
 2 exp_age_bin <20               resp_confiden…  1.65     1   2.4
 3 exp_age_bin <20               resp_feel_saf…  1.71     1   3.8
 4 exp_age_bin <20               resp_safety     1.41     1   2  
 5 exp_age_bin <20               resp_trust_in…  1.41     1   2  
 6 exp_age_bin <20               resp_will_rec…  1.35     1   1.8
 7 exp_age_bin >30               resp_concern_…  3.01     1   5  
 8 exp_age_bin >30               resp_confiden…  1.71     1   3  
 9 exp_age_bin >30               resp_feel_saf…  1.77     1   4  
10 exp_age_bin >30               resp_safety     1.85     1   4  
# ℹ 116 more rows
# Check dimensions
dim(covid_survey_summary_stats_by_group)
[1] 126   6
# Group data in covid_survey_longer only by response
covid_survey_summary_stats_all <- covid_survey_longer |>
  group_by(response) |>
  summarise(
    mean = mean(response_value, na.rm = TRUE),
    low  = quantile(response_value, 0.10, na.rm = TRUE),
    high = quantile(response_value, 0.90, na.rm = TRUE),
    .groups = "drop"
  ) |>
  mutate(
    explanatory = "All",
    explanatory_value = factor("")
  ) |>
  select(explanatory, explanatory_value, everything())

# View the result
covid_survey_summary_stats_all
# A tibble: 6 × 6
  explanatory explanatory_value response         mean   low  high
  <chr>       <fct>             <chr>           <dbl> <dbl> <dbl>
1 All         ""                resp_concern_s…  3.28     1     5
2 All         ""                resp_confidenc…  1.43     1     2
3 All         ""                resp_feel_safe…  1.36     1     2
4 All         ""                resp_safety      2.03     1     5
5 All         ""                resp_trust_info  1.40     1     2
6 All         ""                resp_will_reco…  1.21     1     2
# Check dimensions
dim(covid_survey_summary_stats_all)
[1] 6 6
# Bind both summary tables together
covid_survey_summary_stats <- bind_rows(
  covid_survey_summary_stats_all,
  covid_survey_summary_stats_by_group
)

# Check final dimensions 
dim(covid_survey_summary_stats)
[1] 132   6
# View final summary
covid_survey_summary_stats
# A tibble: 132 × 6
   explanatory explanatory_value response        mean   low  high
   <chr>       <fct>             <chr>          <dbl> <dbl> <dbl>
 1 All         ""                resp_concern_…  3.28     1   5  
 2 All         ""                resp_confiden…  1.43     1   2  
 3 All         ""                resp_feel_saf…  1.36     1   2  
 4 All         ""                resp_safety     2.03     1   5  
 5 All         ""                resp_trust_in…  1.40     1   2  
 6 All         ""                resp_will_rec…  1.21     1   2  
 7 exp_age_bin "<20"             resp_concern_…  3.35     2   4.4
 8 exp_age_bin "<20"             resp_confiden…  1.65     1   2.4
 9 exp_age_bin "<20"             resp_feel_saf…  1.71     1   3.8
10 exp_age_bin "<20"             resp_safety     1.41     1   2  
# ℹ 122 more rows
covid_survey_summary_stats <- covid_survey_summary_stats |>
  mutate(
    response = factor(response, levels = c(
      "resp_safety", "resp_feel_safe_at_work", "resp_concern_safety",
      "resp_confidence_science", "resp_trust_info", "resp_will_recommend"
    )),
    grouping = case_when(
      explanatory == "exp_already_vax" ~ "Had COVID\nvaccine",
      explanatory == "exp_flu_vax" ~ "Had flu\n vaccine this\nyear",
      explanatory_value == "" ~ "All",
      explanatory_value %in% c(">30", "26-30", "21-25", "<20") ~ "Age",
      explanatory_value %in% c("Prefer not to say", "Non-binary/Third gender", "Male", "Female") ~ "Gender",
      explanatory_value %in% c("White", "Native Hawaiian / Other\nPacific Islander", "Black / African American", "Asian", "American Indian /\nAlaskan Native") ~ "Race",
      explanatory_value %in% c("Non-Hispanic / Non-Latino", "Hispanic / Latino") ~ "Ethnicity",
      explanatory_value %in% c("Nursing", "Medical") ~ "Profession",
      TRUE ~ NA_character_
    ),
    grouping = factor(grouping, levels = c("All", "Age", "Gender", "Race", "Ethnicity", "Profession", "Had COVID\nvaccine", "Had flu\n vaccine this\nyear")),
    explanatory_value = factor(explanatory_value, levels = c("<20", "21-25", "26-30", ">30", "", "Female", "Male", "Non-binary/Third gender", "Prefer not to say", "American Indian /\nAlaskan Native", "Asian", "Black / African American", "Native Hawaiian / Other\nPacific Islander", "White", "Hispanic / Latino", "Non-Hispanic / Non-Latino", "Medical", "Nursing", "No", "Yes")),
  )

# define labels for response variables
response_labels <- c(
  "resp_safety" = "Based on my\nunderstanding, I\nbelieve the vaccine\nis safe",
  "resp_feel_safe_at_work" = "Getting the vaccine\nwill make me feel\nsafer at work",
  "resp_concern_safety" = "I am concerned\nabout the safety\nand side effects of\nthe vaccine",
  "resp_confidence_science" = "I am confident in\nthe scientific\nvetting process for\n the new COVID\nvaccines",
  "resp_trust_info" = "I trust the\ninformation that I\nhave received about\nthe vaccines",
  "resp_will_recommend" = "I will recommend\nthe vaccine to\nfamily, friends,\nand community\nmembers"
)

ggplot(covid_survey_summary_stats, aes(x = mean, y = explanatory_value)) +
  geom_point(size = 0.6) +
  geom_errorbarh(aes(xmin = low, xmax = high), height = 0.3, color = "black", size = 0.3) +
  scale_y_discrete(expand = expansion(mult = c(0.05, 0.05))) +
  facet_grid(
    grouping ~ response, 
    scales = "free_y", 
    space = "free_y",
    labeller = labeller(response = as_labeller(response_labels))
    ) +
  labs(
    title = "Mean Likert Scores by Group and Survey Question",
    x = "Mean Likert Score\nError bars range from 10th to 90th percentile",
    y = NULL
  ) +
  theme_minimal(base_size = 11) +
  theme(
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
    panel.spacing = unit(0.1, "lines"),
    strip.background = element_rect(fill = "grey90", color = "black"),
    axis.text.y = element_text(size = 5),
    strip.text.y.right = element_text(size = 4, angle = pi / 6, margin = margin(t = 80, r = 1, b = 80, l = 1)),
    strip.text.x = element_text(margin = margin(t = 5, r = 0, b = 5, l = 0), size = 6)
  )

Sources

To not select a column, used ‘-’ directly from: https://stackoverflow.com/questions/49582602/how-not-to-select-columns-using-select-dplyr-when-you-have-character-vector-of

Directly referenced to remove NA data before evaluating expression: https://stat.ethz.ch/R-manual/R-devel/library/base/html/mean.html

Referenced for quantile function: https://stat.ethz.ch/R-manual/R-devel/library/stats/html/quantile.html

Referenced for ticks at end of bars: https://ggplot2.tidyverse.org/reference/geom_errorbarh.html

Referenced to modify the theme: https://ggplot2.tidyverse.org/reference/theme.html

Referenced to provide more space between rows: https://ggplot2.tidyverse.org/reference/scale_discrete.html

4 - COVID survey - re-reconstruct

# Group data in covid_survey_longer
covid_survey_summary_stats_by_group_25_75 <- covid_survey_longer |>
  group_by(explanatory, explanatory_value, response) |>
  summarise(
    mean = mean(response_value, na.rm = TRUE),
    low = quantile(response_value, 0.25, na.rm = TRUE),
    high = quantile(response_value, 0.75, na.rm = TRUE),
    .groups = "drop"
  )

# Group data in covid_survey_longer only by response
covid_survey_summary_stats_all_25_75 <- covid_survey_longer |>
  group_by(response) |>
  summarise(
    mean = mean(response_value, na.rm = TRUE),
    low  = quantile(response_value, 0.25, na.rm = TRUE),
    high = quantile(response_value, 0.75, na.rm = TRUE),
    .groups = "drop"
  ) |>
  mutate(
    explanatory = "All",
    explanatory_value = factor("")
  ) |>
  select(explanatory, explanatory_value, everything())

# Bind both summary tables together
covid_survey_summary_stats_25_75 <- bind_rows(
  covid_survey_summary_stats_all_25_75,
  covid_survey_summary_stats_by_group_25_75
)

covid_survey_summary_stats_25_75 <- covid_survey_summary_stats_25_75 |>
  mutate(
    response = factor(response, levels = c(
      "resp_safety", "resp_feel_safe_at_work", "resp_concern_safety",
      "resp_confidence_science", "resp_trust_info", "resp_will_recommend"
    )),
    grouping = case_when(
      explanatory == "exp_already_vax" ~ "Had COVID\nvaccine",
      explanatory == "exp_flu_vax" ~ "Had flu\n vaccine this\nyear",
      explanatory_value == "" ~ "All",
      explanatory_value %in% c(">30", "26-30", "21-25", "<20") ~ "Age",
      explanatory_value %in% c("Prefer not to say", "Non-binary/Third gender", "Male", "Female") ~ "Gender",
      explanatory_value %in% c("White", "Native Hawaiian / Other\nPacific Islander", "Black / African American", "Asian", "American Indian /\nAlaskan Native") ~ "Race",
      explanatory_value %in% c("Non-Hispanic / Non-Latino", "Hispanic / Latino") ~ "Ethnicity",
      explanatory_value %in% c("Nursing", "Medical") ~ "Profession",
      TRUE ~ NA_character_
    ),
    grouping = factor(grouping, levels = c("All", "Age", "Gender", "Race", "Ethnicity", "Profession", "Had COVID\nvaccine", "Had flu\n vaccine this\nyear")),
    explanatory_value = factor(explanatory_value, levels = c("<20", "21-25", "26-30", ">30", "", "Female", "Male", "Non-binary/Third gender", "Prefer not to say", "American Indian /\nAlaskan Native", "Asian", "Black / African American", "Native Hawaiian / Other\nPacific Islander", "White", "Hispanic / Latino", "Non-Hispanic / Non-Latino", "Medical", "Nursing", "No", "Yes")),
  )

ggplot(covid_survey_summary_stats_25_75, aes(x = mean, y = explanatory_value)) +
  geom_point(size = 0.6) +
  geom_errorbarh(aes(xmin = low, xmax = high), height = 0.3, color = "black", size = 0.3) +
  scale_y_discrete(expand = expansion(mult = c(0.05, 0.05))) +
  facet_grid(
    grouping ~ response, 
    scales = "free_y", 
    space = "free_y",
    labeller = labeller(response = as_labeller(response_labels))
    ) +
  labs(
    title = "Mean Likert Scores by Group and Survey Question",
    x = "Mean Likert Score\nError bars range from 25th to 75th percentile",
    y = NULL
  ) +
  theme_minimal(base_size = 11) +
  theme(
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
    panel.spacing = unit(0.1, "lines"),
    strip.background = element_rect(fill = "grey90", color = "black"),
    axis.text.y = element_text(size = 5),
    strip.text.y.right = element_text(size = 4, angle = pi / 6, margin = margin(t = 80, r = 1, b = 80, l = 1)),
    strip.text.x = element_text(margin = margin(t = 5, r = 0, b = 5, l = 0), size = 6)
  )

The previous plot, which used the 10th to 90th percentiles, displayed the middle 80% of the data. This provided a view of most of the response distribution, observing a wide range of answers while excluding the most extreme values. In contrast, the 25th to 75th percentile range shows the middle 50% of the data. This range represents the most typical or frequently occurring values with less influence from outliers.

As a result, the plots differ. The plot using the 25th to 75th percentile generally displays narrower confidence intervals, making the data more precise. This suggests that while there’s variability in responses, the central half of responses is more consistent than the wider middle 80% range.

In both the 25th to 75th percentile plot and the 10th to 90th percentile plot, for the response “Getting the vaccine will make me feel safer at work,” nurses had an average response close to 1, with a relatively narrow confidence interval. This indicates strong agreement among nurses and a precise estimate of their average response. The confidence interval becomes even narrower in the 25th to 75th percentile plot, reflecting reduced influence from outliers.

Medical professionals, by comparison, had a mean response closer to 2. While their confidence interval also narrows in the 25th to 75th percentile plot, it remains wider than that of the nurses, indicating greater variability in their responses. This suggests that, although medical professionals tended to agree with the statement, they were on average slightly less certain than nurses.

Overall, while the means remain similar across both plots, the 25th to 75th percentile plot offers tighter confidence intervals, highlighting the middle 50% of responses with less influence from outliers and thereby improving precision.

5 - COVID survey - another view

(A)

# pivot responses to long format
likert_long <- relabeled_survey |>
  pivot_longer(
    cols = starts_with("resp_"),
    names_to = "question",
    values_to = "response_value"
  ) |>
  filter(!is.na(response_value))

# calculate percentage of each response value for each question
likert_summary <- likert_long |>
  group_by(question, response_value) |>
  summarise(n = n(), .groups = "drop") |>
  group_by(question) |>
  mutate(percentage = n / sum(n) * 100)

# convert to signed percenatges
likert_summary <- likert_summary |>
  mutate(
    perc_signed = case_when(
      response_value %in% c("1", "2") ~ percentage,
      response_value %in% c("3") ~ 0,
      response_value %in% c("4", "5") ~ -percentage
    )
  )

# response variable is numeric and will need to be a factor for scale_fill_manual in the plot
likert_summary <- likert_summary |>
  mutate(response_value = factor(response_value, levels = c("1", "2", "3", "4", "5")))


# define labels for response variables
response_labels <- c(
  "resp_safety" = "Based on my understanding,\nI believe the vaccine is safe",
  "resp_feel_safe_at_work" = "Getting the vaccine will\nmake me feel safer at work",
  "resp_concern_safety" = "I am concerned about the safety\nand side effects of the vaccine",
  "resp_confidence_science" = "I am confident in the scientific vetting\nprocess for the new COVID vaccines",
  "resp_trust_info" = "I trust the information that I have\nreceived about the vaccines",
  "resp_will_recommend" = "I will recommend the vaccine to family,\nfriends, and  community members"
)

ggplot(likert_summary, aes(x = perc_signed, y = question, fill = response_value)) + 
  geom_col(position = position_stack(reverse = TRUE), width = 0.6) +
  scale_fill_manual(
    values = c( "1" = "#b2182b", "2" = "#ef8a62","3" = "#7aa381", "4" = "#67a9cf", "5" = "#2166ac"),
    name = "Likert Response",
    labels = c("1", "2", "3", "4", "5")
  ) +
  scale_y_discrete(labels = response_labels) +
  scale_x_continuous(labels = abs) +  # Show positive labels on x-axis
  labs(
    title = "COVID Vaccine Survey Responses",
    x = "Percentage",
    y = NULL
  ) +
  theme_minimal(base_size = 11) +
  theme(
    panel.grid.major.y = element_blank(),
    axis.text.y = element_text(size = 8),
    legend.position = "bottom"
  )

The diverging bar chart displays the percentage of responses for each COVID vaccine survey question. The y-axis lists the survey questions, while the x-axis represents the percentage of likert responses. For each question, agree responses (‘1’ and ‘2’) extend to the right, neutral responses (‘3’) are centered at zero, and disagree responses (‘4’ and ‘5’) extend to the left. The length of each bar reflects the percentage of selecting that option. From the chart, it is clear that a large majority of people tended to agree with most of the survey statements. However, there is an exception in the question regarding concern about the safety and side effects of the vaccine with a majority of responses not agreeing.

(B)

# Pivot responses to long format
likert_long <- relabeled_survey |>
  pivot_longer(
    cols = starts_with("resp_"),
    names_to = "question",
    values_to = "response_value"
  ) |>
  filter(!is.na(response_value))

# Calculate percentage of each response value for each question
likert_summary <- likert_long |>
  group_by(question, response_value) |>
  summarise(n = n(), .groups = "drop") |>
  group_by(question) |>
  mutate(percentage = n / sum(n) * 100)

likert_summary <- likert_summary |>
  mutate(
    question = factor(question, levels = c(
      "resp_safety", "resp_feel_safe_at_work", "resp_concern_safety",
      "resp_confidence_science", "resp_trust_info", "resp_will_recommend"
    )),
    response_value = factor(response_value, levels = c("1", "2", "3", "4", "5"))
  )

# define labels for response variables
response_labels <- c(
  "resp_safety" = "Based on my understanding,\nI believe the vaccine is safe",
  "resp_feel_safe_at_work" = "Getting the vaccine will\nmake me feel safer at work",
  "resp_concern_safety" = "I am concerned about the safety\nand side effects of the vaccine",
  "resp_confidence_science" = "I am confident in the scientific vetting\nprocess for the new COVID vaccines",
  "resp_trust_info" = "I trust the information that I have\nreceived about the vaccines",
  "resp_will_recommend" = "I will recommend the vaccine to family,\nfriends, and  community members"
)

# Plot diverging bar chart
ggplot(likert_summary, aes(x = percentage, y = question, fill = response_value)) + 
  geom_col(position = position_stack(reverse = TRUE), width = 0.6) +
  scale_fill_manual(
    values = c( "1" = "#b2182b", "2" = "#ef8a62","3" = "#7aa381", "4" = "#67a9cf", "5" = "#2166ac"),
    name = "Likert Response",
    labels = c("1", "2", "3", "4", "5")
  ) +
  scale_y_discrete(labels = response_labels) + 
  labs(
    title = "COVID Vaccine Survey Responses",
    x = "Percentage",
    y = NULL
  ) +
  theme_minimal(base_size = 11) +
  theme(
    panel.grid.major.y = element_blank(),
    axis.text.y = element_text(size = 8),
    legend.position = "bottom"
  )

The 100% bar chart shows the percentage distribution of responses to COVID vaccine survey questions. The colored segments indicate response values from 1 (strongly agree) to 5 (strongly disagree). The chart reveals varying levels of agreement across questions, with generally higher agreement recommending the vaccine to others and lower agreement on vaccine safety concerns.

To compare the bars, it was visually easier to interpret the diverging bar chart. It clearly showed that the majority of people agreed with the survey questions. While the 100% stacked bar chart also indicated that most respondents agreed, it was much more difficult to compare the distribution of individual likert responses, especially those with smaller percentages. The diverging bar chart made it easier to visually assess both the overall trends and the relative size of less popular response categories.

Sources

Summarise each group into single row, directly used: https://dplyr.tidyverse.org/reference/summarise.html

To get info on the group reference: https://dplyr.tidyverse.org/reference/context.html

To select colors used: https://colorbrewer2.org/#type=sequential&scheme=BuGn&n=3

Referenced to stack bars on top of each other: https://ggplot2.tidyverse.org/reference/position_stack.html

To adjust geom_bar I referenced used: https://ggplot2.tidyverse.org/reference/geom_bar.html